pandas Data Cleaning for Web Scraping: Tables, Missing Values, Merging

pandas data cleaning for web scraping is one of the fastest ways to turn messy crawler outputs into analysis-ready datasets. In this chapter, you will learn how to extract tables from web pages, convert text fields into numeric values, handle missing and duplicate rows, and merge multiple datasets efficiently.Data cleaning is typically the final stage of a Web Scraping API data pipeline.

9. pandas data processing

pandas is one of the most practical libraries in the Python ecosystem for cleaning, transforming, and merging crawler data. Therefore, if you master pandas well, you will process data faster and more consistently than relying on spreadsheets.

Install pandas

pip install pandas

9.1 Extract HTML tables with

read_html

In many scraping scenarios, the page already contains structured tables. Instead of manually parsing HTML, you can let pandas extract tables directly.

For example, you can read tables from Wikipedia:

import pandas as pd

tables = pd.read_html("https://en.wikipedia.org/wiki/Minnesota")
print(f"Total tables: {len(tables)}")

If you only want a specific table, you can match by table title text:

tables = pd.read_html(
    "https://en.wikipedia.org/wiki/Minnesota",
    match="Election results from statewide races"
)
df = tables[0]
df.head()

Next, check column types:

df.info()

9.2 Convert percentage strings to numeric values

Often, scraped numbers come with symbols such as %. As a result, direct conversion will fail:

df["GOP"].astype("float")

You can remove % first and then convert:

df["GOP"].replace({"%": ""}, regex=True).astype("float")

If you want to convert multiple columns at once, remove % globally and then apply numeric conversion:

df = df.replace({"%": ""}, regex=True)
df[["GOP", "DFL", "Others"]] = df[["GOP", "DFL", "Others"]].apply(pd.to_numeric)
df.info()

This step is typical in pandas data cleaning for web scraping, because websites often format numbers for humans rather than for analysis.

9.3 Handle missing values (NaN)

Missing values can distort metrics and aggregations. Therefore, you should detect them early.

Detect missing values:

isnull()

import pandas as pd

df = pd.DataFrame({
    "A": [1, 2, None, 4],
    "B": [None, 2, 3, 4],
    "C": [1, None, None, 4]
})

print(df.isnull())

Count missing values per column:

isnull().sum()

print(df.isnull().sum())

Remove missing rows or columns:

dropna()

df_cleaned_rows = df.dropna()        # drop rows with any NaN
df_cleaned_cols = df.dropna(axis=1)  # drop columns with any NaN

In practice, you often combine dropping with filling. For example, you may fill categorical fields with “unknown” and keep numeric fields for analysis.

9.4 Handle duplicate values

Scraped data often contains duplicates because pages repeat items or APIs return overlapping results. So you should detect and remove duplicates.

Detect duplicates:

duplicated()

import pandas as pd

df = pd.DataFrame({
    "A": [1, 2, 2, 4],
    "B": [5, 6, 6, 8]
})

print(df.duplicated())

Drop duplicates:

drop_duplicates()

df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

9.5 Common string and type conversions

Crawler outputs frequently store numbers and time as strings. Therefore, you should normalize them before analysis.

Useful pandas operations:

pd.to_datetime() to parse time fields
str.strip() to remove leading/trailing whitespace
str.lower() to normalize case
str.replace() to standardize values

Example:

import pandas as pd

df = pd.DataFrame({
    "A": ["1", "2", "3", "4"],
    "B": ["2021-01-01", "2021-02-01", "2021-03-01", "2021-04-01"],
    "C": ["  hello ", " world ", " old_value ", " new_value "]
})

df["A"] = df["A"].astype(float)
df["B"] = pd.to_datetime(df["B"], errors="coerce")

df["C"] = df["C"].str.strip()
df["C"] = df["C"].str.lower()
df["C"] = df["C"].replace({"old_value": "new_value"})

print(df)

9.6 Regular expressions on columns

Regex is common in scraping, especially for extracting and standardizing text.

Extract with regex

df["phone_area"] = df["phone"].str.extract(r"\((\d{3})\)")

Vectorized string concatenation

df["name"] = df["first_name"].str.cat(df["last_name"], sep=" ")

Fuzzy matching (optional)

If you need approximate matching, you can use fuzzywuzzy:

pip install fuzzywuzzy

from fuzzywuzzy import fuzz
df["similarity"] = df.apply(lambda x: fuzz.ratio(x["name1"], x["name2"]), axis=1)

9.7 High-performance data merging

After pandas data cleaning for web scraping, you often merge datasets, for example:

user profiles + order lists
product tables + inventory tables
company A data + company B data

Fast merging based on index:

join()

result = df1.join(df2, how="left")

Example:

import pandas as pd

df1 = pd.DataFrame({"A": ["A0", "A1"], "B": ["B0", "B1"]}, index=["K0", "K1"])
df2 = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])

print(df1.join(df2))

Merge by key:

merge()

result = pd.merge(df1, df2, on="key", how="outer", indicator=True)

The indicator=True flag helps you track which rows come from which table. As a result, debugging merges becomes easier.

Concatenate by axis:

concat()

pd.concat([df1, df2], axis=1, keys=["2024", "2025"])

concat() works well when you stack rows or align columns by index. Moreover, it supports join=”inner” to keep only overlapping columns.

9.8 Save cleaned data into a database

Once your dataset has no obvious missing values, duplicates, or formatting issues, you can store it.

For example, you can save a DataFrame into SQL using to_sql:

df.to_sql("table_name", con)

At this point, the stored data is cleaner, more consistent, and ready for downstream queries, dashboards, or APIs.

Related Web Guides

This article is part of 10 most efficient data cleaning techniquestopic cluster.
You may also find the following guides useful:

Top Data Cleaning Techniques for Web Scraping Engineers (Part 1)
Python Data Cleaning for Web Scraping: JSON, MongoDB, and Regex Techniques(Part 2)
News Content Extraction for Web Scraping: GNE and Newspaper3k (Part 3)

Summary

This chapter showed a practical workflow for pandas data cleaning for web scraping: extract tables, normalize numeric fields, handle missing/duplicate values, apply string cleanup, use regex, merge datasets, and finally store results into a database. As a result, your crawler output becomes analysis-ready data instead of raw noise.